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251  Mercer  Street 

New  York,    N.    Y.       10012 

ABSTRACT 

The  problem  of  choosing  a  theoretical  abstract  model  of  parallel 
computation  to  be  simulated  by  an  exclusive-read  exclusive-write 
parallel  RAM  or  a  synchronous  distributed  machines  is  discussed.  The 
principle  of  choosing  the  most  permissible  model  of  parallel 
con^jutation  as  long  as  the  cost  of  computational  resources  does  not 
increase   is   applied. 

We  prove  two  theorems.  The  first  (resp.  second)  theorem  asserts 
that  for  every  exclusive-read  exclusive-write  parallel  RAM  (resp. 
synchronous  distributed  machine)  and  every  simulation  of  the 
concur  rent -read  exclusive-write  parallel  RAM  model  of  computation  into 
this  machine,  there  exists  a  simulation  of  the  Fetch-and-Add  parallel 
RAM  into  the  same  machine  that  uses  the  same  order  of  computational 
resources.  It  implies  the  choice  of  a  Fetch-and-Add  parallel  RAM  model 
of    computation. 

A         premising  methodological  notion,  called  Execution 

Reducibili  ties ,    is   being   exercised   in  the   proof   of   the   main  theorems. 


1.    INTRODUCTION 

The  understanding  that  algorithms  for  parallel  machines  will  be 
designed  for  an  abstract  model  of  computation  (design  space)  to  be 
autonB  tical  ly  simulated  by  a  machine  is  shared  by  many  researchers.  A 
similar  approach  is  actually  exercised  in  available  computers  where 
users  write  algorithms  in  high-level  languages.  These  algorithms  are 
translated  by  the  machine  into  atomic  instructions  that  the  machine  can 
execute.  This        understanding        served        as        a     guideline      for     two 

general -purpose  parallel  computers  that  have  been  proposed:  the 
Ultraconputer  [Schwartz  80]  and  the  Parallel -De  sign 

Distributed-Implementation  (PDDI)  [Vishkin82].  This  understanding  is 
also  implied  (e3q)licitly  or  implicitly)  by  most  cf  the  many,  papers  that 
contain  algorithms  which  are  designed  for  abstract  models  of  parallel 
computation. 

The  present  paper  falls  mainly  within  the  following  abstract 
scheme.  Suppose  that  a  model  M  of  synchronous  parallel  computation 
that  allows  some  operations  in  any  cycle,  but  forbids  others,  is  given. 
Assune,  that  it  is  possible  to  simulate  every  cycle  of  any  algorithm  in 
M,  within  certain  resource  requirements,  into  some  weaker  model  of 
con^jutation  D,  that  represents  a  technologically  available  machine.  We 
are  interested  in  the  question:  Is  it  possible  to  allow  in  cycles  of  M 
some  of  the  forbidden  operations  and  still  get  a  simulation  of  each 
cycle   into   D,    within  about  the   same   resource   requirements? 

The  advantages  of  succeeding  to  give  an  affirmative  answer  to  the 
above  question  are  the  following.  Imagine  that  a  programmer  wrote 
programs  for  model  M  to  be  later  translated  by  the  (aforementioned) 
simulation  into  D.  Now,  the  person  has  more  powerful  primitives  in  his 
programming  language  (the  language  describing  model  M) .  Therefore,  we 
enable  simpler  and  more  efficient  (in  terms  of  composition  '  of 
primitives)  algorithms    to   be   designed   and   programs    to  be  written. 

Actually,  we  do  something  more  general  than  that.  Imagine  that 
instead  of  one  specific  simulation  the  aforementioned  abstract  scheme 
refers  to  a  large  family  of  simulations  S  and  the  assumption  holds  for 
every  s  in  S,  with  resource  requirements  which  depends  on  s.  We  manage 
to     answer  the   question   in   the  affirmative  for  all   s  in   S  at  once.      The 


-3- 

f  amily  S  is  specified  in  a  way  that  will  include  'reasonably  defined' 
simulations   that   may   be  proposed. 

Let  us  be  more  specific.  We  prove  two  theorems.  In  one  the  model 
of  computation  D  (which  represents  a  machine)  is  the  exclusive-read 
exclusive -write  (ERW)  PRAM.  It  is  a  synchronous  model  of  parallel 
con5)utation  (due  to  [Lev,  Pippenger  and  Valiant  81])  in  which  all 
processors  have  access  to  a  shared  memory.  No  simultaneous  access  of 
more  than  one  processor  to  the  same  memory  location  is  allowed.  In  the 
second  theorem  the  model  of  computation  D  is  a  synchronous  distributed 
model  of  computation  also  called  synchronous  distributed  computer  (SDC) 
(it  is  used  in  [Galil  and  Paul  81]  and  [Vishkin  82].  In  this  model  of 
conuptation  eachr  processor  niay  communicate  directly  with  <  c  others 
where  c  isj  a^:  ''small'  constant.  This  'limited  valence'  restriction 
seems  to  be  the  one  and  only  restriction  which  is  imposed  by  current 
technology.  The     model      of      computation     M      is      the      concur  rent -read 

exclusive-write  parallel  RAM  (CREW-PRAM).  It  is  a  synchronous  model  of 
parallel  computatioin  (due  to  [Fortune  and  Wyllie  78])  in  which  all 
processors  have  access  to  a  shared  memory.  Simultaneous  access  to  the 
same  memory  location  for  read  purposes  are  allowed  but  not  for  write 
purposes.  Generally,  we  show  that  given  a  simulation  of  the  CREW-PRAM 
into  the  EREW-PRAM  (resp.  SDC)  it  is  possible  to  modify  it  into  a 
simulation  of  the  Fe tch-and-Add  PRAM  (F&A  PRAM)  into  the  EREW  PRAM 
(resp.  SDC)  within  the  'same'  time  and  common  and  local  memory 
requirements  of  the  EREW  PRAM  (resp.  time  and  local  memory 
requirements)  of  the  SDC.  The  F&A  PRAM  allows  everything  that  the  CREW 
PRAM  allows.  In  addition  to  that  the  following  instruction  is  assumed 
to  take  one  cycle  of  the  F&A  PRAM:  say  that  k  processors  ij^,...,i^ 
execute  simultaneously  the  instruction  F&A(A,e]^),  ...,  F&A(A,e^^), 
respectively;  where  A  is  a  common  memory  address  and  e.  is  some 
address,  in  the  local  memory  of  processor  i.  ,  for  1  <  j  <  k.  The 
result  will  be   as   follows: 

1.  A  t-    A  +  e^   +  62   +  .  .  .     +  ej;.. 

2.  For  some  permutation  a  of  {l,2,...,k}  (which  is  not  known  in 
advance)    the  k   suns 


are  stored  in  local  registers  of  processors  ^a  (i)  ti-n  (O)  >  •••>  VCk")  » 
respectively.  See  [Gottlieb,  lubatchevsky  and  Rudolph  81]  and  [Rudolph 
81]  for  a  demonstration  of  the  effectiveness  of  this  F&A  primitive  with 
respect  to  the  design  of  parallel  algorithms  and  operation  systems. 
This  primitive  is  available  in  the  NYU-Ultracomputer  general-purpose 
parallel  computer  ([Gottlieb  et  al.  82]).  The  present  paper  supports 
augmenting  this  primitve  on  general  theoretical  grounds.  Namely,  the 
decision  reflects  an  impact  of  the  physical  limitations  of  distributed 
machines  on  the  theory  of  parallel  computation  sind  is  justified 
regardless  of  the  choice  of  the  interconnection  network  for  the 
distributed  machine. 

The  effort  of  replacing  the  CRSJ  PRAM  by  the  F&A  PRAM  model  of 
computation  is  justified  because  of  the  following:  [Cook  and  Dwork  82] 
prove  a  lower  bound  of  ti(log  n)  for  computing  the  OR  of  n  bits.  This, 
hcwever,  can  be  easily  done  in  constant  time  using  one  F&A  instruction. 
[Stockmeyer  and  VishkLn  82]  (using  a  result  of  [Furst,  Saxe  and  Sipser 
81])  implied  that,  even  if  some  conventions  for  resolving  write 
conflicts  are  allowed  it  is  still  impossible  to  compute  the  parity  of  n 
bits  in  constant  time  by  a  polynomial  number  of  processors.  (Models  of 
con^jutation  with  such  conventions  are  called  CRCW  PRM.  These  models 
and  ways  to  implement  them  in  the  F&A  PRAM  or  possible  extensions  of 
the  F&A  PRAM  are  mentioned  in  the  last  section.)  An  F&A  PRAM  can 
readily  compute  this  parity  problem  in  constant  time  with  n  processors. 
[Vishkin  and  Wigderson  82]  obtained  lower  bounds  of  S^(/n/m)  for  the  OR 
(resp.  parity)  of  n  bits  in  the  CREW  (resp.  CRCW)  PRAM  where  the 
input  is  in  a  read-only  shared  memory  and  the  read/write  shared  memory 
is   of   size  m. 

The  presentation  starts  by  showing  that,  generally,  given  a 
simulation  of  a  model  of  computation  similar  to  the  CREW  PRAM  into  the 
EREW  PRi^M,  it  can  be  modified  efficiently  into  a  simulation  of  an  F&A 
PRAM  into  the  EREW  PRM.  This  idea  of  modifying  a  simulation 
(algorithm  or  computational  process)  to  form  another  computational 
process,  called  execution  reducibili  ties  (or  compositions),  is  being 
studied      in      [Vishkin     83].     There,   several   applications    of   this    notion 
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a  re  being  examined.  They  relate  to  examples  taken  from  [Megiddo  81], 
[Vishkin  82]  and  this  paper,  as  well  as  semantics  of  programs  and  the 
theoiy        of        NP-completeness.  The        wide        notion        of  execution 

reducibili ties ,  which  is  formalized  there,  seems  to  be  tailored  for 
reducibili ties  in  computational  environments  such  as  distributed  or 
pargdlel  computation.  This  is  so,  since  it  seems  typical  to  require 
from  distributed  or  parallel  procedures  a  many -parameter  optimization. 
If  execution  reducibili  ties  are  applied  successfuly  they  can  form  a 
correspondence  between  nary  parameters,  simultaneously.  Theorems  1  and 
2  below  suggest  examples  both  of  such  requirements  in  parallel  and 
distributed  computation  and  the  way  they  are  handled  by  execution 
reducibili ties. 

Section  3  shows  how  to  exchange  the  EREW  PRAM  of  Section  2  with  an 
SDC.  This  requires  a  few  interesting  simulations  of  some  synchronous 
distributed  machines    by  others. 

Section  4  discusses  possible   extensions    of   the   results. 

2.     SIMUIATIONS    INTO   EREW    PRAM 
2.1.      Models   of   computation 

In  this  section  we  deal  with  two  models  of  computation,  the  F&A 
PRAM  and  the  EREW  PRM.  Let  us  start  with  a  less  informal  definition 
of  these  models.  The  F&A  PRM  (resp.  EREW  PRAM)  employs  a  sequence  of 
(processors)     RAM's      PpP2,...,Pp        (resp.  Rj^  ,R2,  . . .  ,Rg)        operating 

synchronously  in  parallel,  where  p  (resp.  s)  is  some  positive  integer. 
Each  individual  RAM  is  similar  to  a  standard  one-processor  RAM  (cf. 
[Aho,  Ibpcroft  and  Ullman  7  4].,  Chap.  1].  Each  RAM  has  its  own  local 
finite  random-access  memory  and  has  instructions  for  typical  arithmetic 
and  boolean  operations,  conditional  branches  based  on  typical  binary 
predicates  and  reading  and  writing  into  its  local  memory.  Each 
processor  P^  has  a  specified  private  register  which  contains  its  number 
i.  The  RAM's  also  have  access  to  a  finite  common  memory  and  each  RAM 
has  instructions  for  accessing  the  common  memory  using  its  private 
registers  to  specify  the  common  memory  address  and  other  information 
which  is  required  for  the  specific  access.  The  way  simultaneous  access 
to  the  same  common  memory  location  is  resolved  is  the  main  difference 
between     the      two     models.        Without     loss  of   generality  we  assume  that 
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instructions  are  of  the  following  forms.  Let  Mj.Mj,...  denote  local 
memory  roisters  and  res  (result)  and  op  (operand)  be  positive 
integers.  In  order  to  simplify  the  presentation,  we  assume  that  each 
cycle  of  the  F&A  PRAM  includes  three  subcycles,  called  pulses.  (We 
come  bade  to  the  EREW  PRAM  later.)  Each  processor  may  perform  any  of 
the  instructions  that  relate  only  to  its  local  memory  in  the  first 
pulse.  The  second  pulse  is  called  a  reading  pulse.  In  each  such  pulse 
each  processor  may  perform  one  instruction  of  the  following  form, 
Mj-gg  •<-    IhMq      (indirect    common  memory   READ)    . 

The     contents      of      the      common      location  whose  address   is   in   local 
register  M^p   is   read   into   local  register  M^^g   .        The     third      pulse      is 
called      the      Fetch     and     Add      (F&A)     pulse.        In     eacii     s^ch  pulse  each 
processor  may  perform  one   instruction  of  the   i:oHowli%-' fbrm-i"- 
■'■^res  ■*■    ^p     (indirect   common  memory  F&A)    . 

All  processors   for  which  M^^g  have   identical  contents   (they      refer 
to      the    same  common  memory  address)  are  virtually  sorted   in  some   order. 
Then,    the  content    of  ^^^  and   M^      for  all   tV^se   processors  will     be      as_ 
if  the  following   were  performed   sequentially: 
For  the    first   processor 

^aux  "^   ^op     ^^aux  ^^   ^"  auxiliary   local   register) 
"op  -   I^^^res 
^^res  -    Maux  +  ^^iM^es 
Then,      the      same      is    done    for  the   second   processor;    then,   for  the   third 
and  so  on . 

By  convention,  the  machine  halts  if  an  indirect  READ,  WRITE,  or  an 
F&A  refers  to  a  nonexistent  address.  This  completes  the  description  of 
the  F&A    PRAM. 

A  cycle  of   the   EREW   PRAM  may  include  all   the   instructions      of      the 
first  and    second   pulses  above  as   well  as  instructions   of   the   form     ' 
InMj.gg  ■*■    Mq^      (indirect   common  memory  WRITE) 

The  contents  of  local  register  M^  ,  is  written  into  the  common 
location  whose  address   is   in  M  , 

By  convention,  the  EREW  PRAM  halts  if  an  indirect  READ  or  WRITE 
refers  to  a  nonexistent  address  or  if  two  processors  refer 
simultaneously  to   the   same  memory   location. 

Denote  by  con(M.)   the  content    of   memory   register  Mj    .      Conventions 


regarding  input  data  and  the  instructions  of  the  algorithm  that  is 
simulated  are  not  critical.  This  is  since  we  assume  below  that  a 
simulation  for  a  given  set  of  F&A  PRAM  instructions  is  given  and  then 
extend  it  to  more  instructions.  Any  convention  for  the  restricted  set 
will  be   readily  applicable   to   the  extended   set. 

2.2.      The  main   theorem 

We  show  below  that  a  simulation  of  an  F&A  pulse  by  the  EREW  PRAM 
does  not  require  'significantly'  more  time  and  space  than  a 
corresponding  simulation  of  a  corresponding  reading  pulse  by  the  EREW 
PRAM  (there  are  two  simultaneous  correspondences  above:  between 
simulations  and  between  pulses ).  It  should  be  noted  that  the  first  and 
second  pulsesrr-of^  ttje:F&A.  fRM  form  an  even  weaker  model  than  the  CREW 
PRAM  since  these  pulses  do  not  enable  the  processors  to  write  into  the 
common  memory. 

Definition.  Suppose  that  an  algorithm  A  for  the  F&A  PRAM  runs  in  Z 
pulses  (numbered  1,2, ...,Z)  on  some  input.  I.  Suppose  that  this  input 
is  located  initially  at  some  (local  or  common)  memory  locations.  Let  M 
=  {m^  ,m2,  • . .  ,mg}  be  the  set  of  all  local  and  common  memory  addresses  of 
the  F&A  PRM.  Let  f  be  a  one  to  one  function  f :  M  ->  {memory  addresses 
of  EREW  PRAM}  .  We  say  that  an  EREW  PRAM  simulated  pulse  z  ,  1  <  z  <  Z, 
with  respect  to  the  function  f,  from  time  ti  to  time  to  ,  if  con(mj)  at 
the  beginning  (resp.  at  the  end)  of  pulse  z  is  the  same  as  con(f(m^)) 
of  the  EREW  PRAM,  at  time  t^  (resp.  t2),  for  every  m^  in  M.  We  say 
that  an  EREW  PRAM  simulated  algorithm  A  on  input  I  from  time  T^^  to  time 
T2  if  there  exists  a  one  to  one  function  f  (with  domain  and  range,  as 
before)  and  a  squence  T^  =  tg  <  t^  <  ...  <  H  '^  '^2  ^^^^  ^^^^  ^^^  ^^^^ 
PRAM  simulated  pulse  z  with  respect  to  f,  from  time  t^_^  to  time  t^  , 
for  every    1  <    z  <    Z. 

We  list  below  assumptions  which  are  needed  for  the  theorem  that 
follows  ,  in  a  formal  fashion.  This  specification  is  somewhat  lengthy. 
However,  we  feel  that  any  'reasonable'  simulation  (like  all  the  ones  we 
know)  satisfy  these  assumptions  and  guess  that  this  specification 
cannot   be  shortened   significantly. 

L^t  M^esl  ^  T^^pl  .  W^es2  *  ^%p2  '  ""\esk^  ^^^opk  ^e  the 
indirect   common  memory   read   instructions    of      processors     Pj^    ,P^    ,       ..., 


algorithm  for  the  F&A  PRM  for  some  input  (1  <  ij^  <  p,  for  1  <  £  <  k). 
We  assime  that  any  simulation  of  this  pulse  with  respect  to  a  function 
f  into  the  EREW  PRAM  satisfies  the  following  (say  that  the  simulation 
takes  T  time  units  of  the  EREW  PRAM  machine  for  some  integer  T  >  0)  : 
(1)  the  copying  assumption.  For  each  (legitimate)  indirect  common 
memory  read  instruction  M^^g^  t-  InM^^  ,  1  <  i  <  k,  there  exists  a 
sequence   of   (EREW  PRAM)   addresses   a(i,0),    a(i,l ),...,   a(i,T)   such   that: 

(a)  a(i,0)  =  f(con(MQ^))  (where  con(MQ  .)  was  defined  earlier  to  be 
the   common  address  which  is   the   content    of   M      ,) 

(b)  a(i,T)    =  f   (local  memory  address   resi),    (f(resi)   in  short) 

(c)  for  each  1  <  t  <  T,  either  a(i,t-l)  =  a(i,c)  and  -there  is  no 
processor  that  writes  into  this  address  at  tlrae--unit  t,  or  some 
processor  copies  the  content  of  a(i,t-l)  into  a(i,t)  "dt  time  unit  t 
(the  time  elapses  from  time  t-1  to  time  t  of  the  simulation).  In  other 
words,  i  t  is  assumed  that  the  content  of  a_  memory  location  may  be 
broadcasted  only  by  repeatedly  copying  it.  Note  that  the  contents  of 
a(i,j)   at   time   j,    0  <    j  <    T,    is  what     we     wish     to     finally      read      into 


(2)  All  (EREW  PRAM)  addresses  £  such  that  £  =  f  (c)  f  or  some  F&A  PRAM 
address  c  are  'marked'.  For  each  address  £,  such  that  £  5^  f(resi),  for 
all  1  <  i  <  k,  there  exists  a  sequence  of  EREW  PRAM  addresses  a(£,0), 
a(£,l),    ...,   a(£,T)   such   that: 

(a)  a(£,0)    =  a(£,T)    =  £,    and 

(b)  like   (c)   above. 

This  implies  that  none  of  these  addresses  may  change  its  contents 
f  ollowirg    the  simulation. 

(3)  We  further  assume  that  if  f(con(MQ  .))  ?t  f  (con(MQ  .) )  at  time  0, 
for  0  <  i,j  <  k,  then  a(i,t)  ^  a(j,t)  for  all  possible  sequences 
corresponding  to  i  or  j  and  for  all  1  <  t  <;  T-1.  This  assumption 
enables  us  to  avoid  possible  difficulties  in  case  the  contents  of  the 
two  addresses  f  (con(MQ  .) )  and  f(con(MQ  j))  are  equal.  It  is  weaker, 
hcwever,  than  a  possible  assumption  that  the  simulation  'does  not 
depend'    on  the   contents    of   the   common  memory  addresses 


Theorem_l.  Let  InM^^^^  .  M^pj  ,  InM^^32  ^  %2  '•••'  ^^res^^  "  %p^  be 
the  indirect  common  memory  F&A  instructions  of  processors 
P^  ,P^  ,  .  . .  ,P^  ,  respectively,  which  are  active  in  a  given  F&A  pulse, 
of  some  algorithm  for  the  F&A  PRAM  for  some  input  (1  <  i^  <  p,  for  1  < 
I  <  k).  l£t  M^pi  ^  InM^^3l  ,  M„p2  -  InM^g32  '  " "  %p^  ^  ^^\es^  be 
the  indirect  common  memory  read  instructions  of  processors 
Pj  ,P£  ,...,P^  ,  respectively,  of  a  reading  pulse.  Suppose 
(virtually)  that  this  reading  pulse  replaces  the  F&A  pulse  and  we  are 
given  a  simulation  of  the  reading  pulse  by  the  EREW  PRAM  that  requires 
T  time  units,  common  memory  of  size  y  and  local  memories  of  sizes 
y2,y2,«««>ys  with  respect  to  some  function  f.  Suppose,  also,  that  this 
simulation  satisfies  assumptions  (1),  (2)  and  (3)  above.  Then,  it  is 
possible  to  rsimvji^te  the  F&A  pulse  by  the  EREW  PRAM  within  0(T)  time 
and  local  and  common  memories  of  sizes  proportional  to  yi  +  T,  y2+T, 
...,    Yg+T,    y,   respectively,  with   respect    to   f. 

Proof  of  Theorem  1.  The  general  idea  of  the  proof  is  to  virtually 
perform  the  simulation  of  the  reading  pulse.  Then,  we  perform  a  few 
operations  on  the  'execution  sequence'  of  this  simulation  to  form  a 
simulation  of  the  F&A  pulse.  This  implies  the  connection  to  the  notion 
of   execution  reducibility. 

At  the  time  the  F&  A  pulse  had  to  be  performed  we  execute, 
instead,  the  reading  pulse.  This  is  done  in  auxiliary  memory  locations 
without  changitg  the  contents  of  the  'primary'  memory  locations. 
Denote  ty  A  the  set  of  all  local  and  common  memory  locations  of  the 
EREW  PRM:  In  order  to  help  us  to  easily  derive  the  claims  of  the 
theorem  regarding  the  complexity  of  memory  requirements  we  employ  two 
families  of  auxiliary  variables: 

(1)      For  each   memory   location  a   in  A  we  maintain   variables   a^,a2,...,a 
where  a    is   a  constant   that   does   not   depend  on  a.      These     variables      are 
located      at     the      same      (local      or  common)  memory  in  which  address   a  is 
located    (each  processor   that  has   access   to  a   may  have  access   to  any  one 
of   these   variables,    as  well). 

The  second  family  of  variables  reflects  the  performance  of 
processor  Pj^  ,  1  <  i  <  s,  at  time  unit  t  of  the  reading  pulse 
simulation  where  t  is  an  integer  1  <  t  <  T.  It  is  used  later  in  time 
units   corresponding   to   t,    in  order  to  reconstruct   this   performance. 
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(2)  For  each  such  pair  (i,t)  we  maintain  variables  Q^  t  1  »  ^i  t  2  ' 
...»  Q^  (.  o  where  g  is  a  constant  that  does  not  depend  on  t  or  i.  These 
variables   are   stored   in  the    local  memory   of   processor  P.    . 

Below,  we  give  an  overview  of  the  simulation.  A  lower  level 
description  of  an  implementation  of  certain  parts  of  the  simulation  is 
given  in  the  appendix.  However,  I  did  not  make  an  effort  either  to 
decrease  constants  involved  in  the  simulation  and  the  complexity 
evaluation,  or  to  give  an  exact  count  of  them  when  it  was  clear  that 
only  constants  may  arise.  This  remark  applies  in  particular  to  the 
memory  space  required  for  the  a^^  variables,  since  only  a  small  portion 
of  them  is  used  in  a  simulation  of  a  single  pulse.  So,  we  do  not 
specify    the    values   of   the   constants  a    and  3    .  '  j    s.  . 

The  simulation  of  the  F&A  pulse  includes  four  .^t^g)S.jc.  Initially, 
all  auxiliary  variables   are   undefined.  v 

Step    1.  The     reading      pulse      simulation      is      performed     with     slight 

modifications.  For  1  <  i  <  s  and  1  <  t  <  T,  whenever  processor  P^ 
copies  the  contents  of  a  memory  location  b  into  another  memory  location 
c  at  time  unit  t:  (a)  this  is  recorded  at  the  processor's  auxiliary 
varibles  Q^  j.  ,  and  (b)  the  origin  (the  address  which  contained  this 
content  at  the  beginning  of  step  1)  of  the  copied  contents  is  recorded 
into  the  auxiliary   variables   of   memory   location  c. 

The     followirg      (layered)   directed  graph  G(V,E)   is    introduced   as  a 
tool  for  specifying    the   synchronization  of   the   remaining      steps;      edges 
between     two      successive   layers    represent   simultaneous  events   that   take 
place  between  two  successive   ticks   of   our    'clock'.      This  graph  is      used 
also  for  explanation  of  these   steps.      The  set   of   vertices   V   of   G  is   V  = 
{  (a,t)     1   a  E    A,  0  <    t  <    T}  .      Layer   t,    L^.    ,    of  G  is  L^   =  {  (a,t)     ]   a  e    A} 
for   t,   0  <    t  <    T    ,      let 
Ej   =  {    ((a,t-l),    (b,t))     I  A   processor?^    ,    for   some   1  <    i  <    s    , 
copied   the  contents   of   address   a   into 
address   b  at   time   unit   t,    1  <    t  <    T  } 
^2   ~  {((^)t-l),    (a,t))     I  No    processor  wrote   into  address   a  at 
time   unit   t,    1  <    t  <    T} 
The      set   of   edges  E   is    the  union  of  Ej^  and  E2.      Note,    that  all  edges   of 
Ej  were  actually      recorded     at      the      local     memories      of      corresponding 


processors.  The  rest  of  the  simulation  consists  mostly  of  manipulating 
the  set  E^  .  An  edge  of  the  form  ((a,t-l),  (b,t))  is  said  to  emanate 
f  ran  vertex  (a,t-l)  (resp.  layer  t-1)  and  enter  vertex  (b,t)  (resp. 
layer  t) .  Making  the  following  observations  regarding  the  graph  G  is 
simple.      Soine   trivial   details  are    left   to  the   reader. 

(1)  Every  (a,t)  in  V,  1  <  t  <  T,  has  at  most  one  entering  edge.  If  it 
exists   it   emanates   fran  layer  t-1. 

(2)  Every    (a,t)    in  V,    0  <    t  <    T-1,    has   at  most   two   edges      that      emanate 

,  .  ,     .  .         J      c    1    and  at  most   one  which  is 

from     it:    at   most   one  which   is   a  member  in  E^l   **"""  '^'- 

a  member  of  E2  .  In  case  both  edges  exist  the  vertex  (b,t+l),  such 
that  (  (a,t)  ,(b,  t+1)  )  is  in  E^  ,  is  called  a  right  son  of  (a,t)  ,  while 
(a.t+1)  is  the  left  son  of  (a,t);  (a,t+l)  is  the  left  brother  of 
(b,t+l).  For  each  j,  1  <  j  <  k:  denote  by  r.  the  vertex  in  G  which 
corresponds  to  conCM^^g  j)  at  time  t  =  0.  (It  is  (f(con(M^gg  j)),0)); 
and  by  T.  the  directed  (layer)  graph  which  is  a  subgraph  of  G  and 
includes  vertex  r  .  and  all  vertices  and  edges  that  are  reachable  from 
r.  along  a  directed  path  in  G.  Let  r^  be  the  EREW  PRAM  address 
corresponding    to  r  .    . 

(3)  T.  is  a  tree  and  if  r^  and  r^  are  the  same  for  1  <  i,j  <  k  then  T^ 
and  T .   are   the  same. 

(4)  (f(opj),T),  which  corresponds  to  M^p^  at  time  T,  is  a  leaf  of  the 
tree  T.  .  Let  T'  be  a  tree  obtained  from  T^  by  eliminating  all 
vertices  (and  edges)  which  are  not  on  a  path  from  the  root  r^  to  a  leaf 
of  the  form  (f(opi),T),  for  i,  1  <  i  <  k.  These  T^  trees  are  found 
below  and  utilized  to  compute  the  sums  required  for  the  F&A  pulse 
simulation. 

Step  (1)  corresponds  to  moving  down  the  auxiliary  graph  G  (from 
roots  of  the  T.  trees  to  their  leaves),  thereby  keeping  record  of  all 
edges  in  E^  in  local  memories  of  processors,  as  was  mentioned  above. 
Step  2.  A  (virtual)  movement  up  in  layers  of  G  is  performed,  in  T 
synchronous  cycles.  All  edges  of  E^  are  visited  by  the  same  processor 
that  visited  them  in  the  corresponding  time  unit  of  step  (1).  This 
up-sweepirg  of   G  is  used   to  obtain   two  goals. 

(a)  firrling  the  subtrees  T^  by  eliminating  from  T^  all  edges  in  E^  that 
should  not    be  in   these   T.   subtrees. 
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(b)   for  each   tree  T'.  arri   each    (virtual)   vertex    (a,t)   of   T'-   the      sum     of 

its    descendant    leaves  is    computed.      Denote   this   sum  by   S(a,t). 

Step  3.        A     syndironous   down-sweep  is  performed   in  order   to  compute  at 

each  vertex    (a,t)   of  a  tree  T'.   the   sum  of   the   contents    of      r  •     and      all 

leaves      of     T'.     which      are      descendent   of    left   brothers   of   ancestors   of 

(a,t).      This  amounts   to  a   standard   computation  of     partial      sums      by      a 

binary  tree. 

Step   4.        Re-initialize   all   auxiliary  addresses.      Initializing   the   Q-    j. 

variables   should  not   cause  any  trouble   since   we     know     where      they     are 

located.        There      might  be  some  difficulty,    however,   in  locating   the  a 

variables   that      were     updated      through      varioiis     st^s^ This     can     be 

overcome  by  reconstructing  steps  1-3  as  follows.'"  Duriiig  the  algorithm 
keep  for  any  auxiliary  address  a  counter  which  counts  th'e^  number  of 
times  it  is  being  accessed.  Now,  after  step  3  perform  again  the 
algorithm  and  whenever  an  auxiliary  address  is  accessed  for  the  last 
time  (this  can  be  found  by  employing  a  second  counter  for  each 
auxiliary  address  and  comparing   it  to   the    first)   initialize   it. 

Correctness  of  the  simulation  can  now  be  established  by  examining 
the  change  in  (common  or  local)  addresses  that  fall  in  the  range  of  the 
function  f.  The  claims  regarding  complexity  of  time  and  sizes  of 
memories  in  Theorem   1   follow  readily. 
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3.    EXTENSION  TO   SDC 

The  main  goal  of  this  section  is  to  show  how  the  solution  of  the 
previous  section  implies  an  analogue  to  Theorem  1  where  the  EREW  PRAM 
is  replaced  by  a  Synchronous  Distributed  Computer.  Our  definition  of  a 
Synchronous  Distributed  Computer  reminds  one  to  some  extent  of  the 
definitions  of  a  parallel  computer  in  [Galil  and  Paul  81]  and  the 
definition  of  a  syndironous  distributed  computer  in  [Vishkin  82]  .  Note 
that  the  word  'parallel'  in  the  former  paper  corresponds  to 
'distributed'  in  our  taxonomy.  The  definitions  and  discussion  of  this 
section  are   more   informal. 

The  distributed  computation  model:  The  model  of  synchronous  distributed 
computation  (SDC),  that  we  employ,  has  a  sequence  of  RAM's  ?^,...,?^  . 
Each  individual  RAM  is  essentially  idential  to  a  processor  of  the  EREW 
PRAM  as  long  as  the  operations  the  latter  performs  relate  only  to  its 
local  memory.  AJ.1  ^processors  operate  synchronously  in  parallel.  They 
can  be  thought  of  aa,  nodes  (vertices)  that  are  connected  by  fixed  lines 
(edges),  forming  a  graph  of  communication.  Each  processor  has  an 
additional  instruction  that  serves  as  the  main  communication  tool.  The 
information  to  be  communicated  is  loaded  into  a  communication  register 
which  is  attached  to  the  adjacent  line  on  which  this  information  is  to 
be  transmitted.  The  processor  on  the  other  side  of  the  line  may  read 
this  register.  We  further  assume  that  the  degree  of  each  vertex  of  our 
graph  of  communication  does  not  exceed  a  constant,  say  c.  The  following 
things  are  similar  to  the  previous  section.  Reasons  why  input 
conventions  of  both  data  and  the  instructions  of  the  algorithm  being 
simulated  are  not  critical;  the  definition  of  what  is  meant  by  saying 
that  an  SDC  simulated  a  pulse  (resp.  an  algorithm  A  on  an  input  I)  of 
the  F&A  PRAM  with  respect  to  a  function  f;  and  assumptions  (1),  (2)  and 
(3). 

Theorem_2.        Let     InM^ggi       -   M^pl ^nM^esk  *    %k  ^^   the   indirect 

common     memory     F&A     instructions        of        processors        P^    ,?j    ,  . . .  ,Pj      , 

^1      ^2  % 

respectively,      which      are  active   in  a   given  F&A  pulse  of   some   algorithm 

for   the  F&A   PRAM,    on  some   input    (1   <    i^  <    p,    for  1  <   Jl  <    k).      l£t     M^^ 

*      ^^resl*       •••»      %pk    "*■      ■'^'^resk     ^^   ^^^   indirect   common  memory   read 

instructions      of      processors     ?^    ,?^    ,•♦.,      P^      ,      respectively,      of      a 
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reading  pulse.  Suppose  (virtually)  that  this  reading  pulse  replaces 
the  F&A  pulse  and  we  are  given  a  simulation  of  the  reading  pulse  by  the 
SDC  that  requires  T  time  units  and  local  memory  of  sizes  yn  ,  Yo,  •  •  •  .7^ 
satisfies  assumptions  (1).  Then,  it  is  possible  to  simulate  the  F&A 
pulse  by  the  SDC  within  0(T)  time  and  local  memories  of  size 
proportional  to  y^+T,  y2+T,  ...,  y^+T ,  respectively,  with  respect  to  f. 
A  full  proof  of  Theorem  2  follows  essentially  the  same  lines  as 
the  proof  of  Theorem  1.  Vfe  sketch  below  the  main  modifications  required 
in  the  proof  of  Theorem  1  to  form  a  proof  of  Theorem  2.  Details  are 
left   to   the   reader. 

Proof  of  Theorem  2.  Let  MSDC  be  a  slightly  modified  SDC.  The  only 
difference  between  the  two  machines  is  that  the  MSDC  '  allows  several 
comnunlcation  registers  (rather  than  one)  to'be  associated  with  each 
line   of    the    graph   of   communication.      Now  observe   that: 

(1)  Each  of  the  MSDC  and  the  SDC  is  a  restricted  EREW  PRAM.  This  is  in 
the  sense  that  a  communication  register  of  each  of  these  machines  can 
be  thought  of  as  a  common  memory  location  with  the  restriction  that 
only  two  specified   processors   may  access   it. 

(2)  The  proof  of  Theorem  1  presents  an  execution  reducibility  which 
formed  an  F&A  pulse  simulation  by  modifying  a  reading  pulse  simulation. 
Natice,  that  throughout  the  F&A  pulse  simulation  the  processors  of  the 
EREW  PR/M  may  access   only  four  kinds   of   memory  locations: 

(a)  Primary  memory  locations  that  were  accessed  by  tl«  same  processor 
at   the   reading   pulse  simulation 

(b)  Auxiliary  memory  locations  which  are  located  in  the  local  memory  of 
a  processor  and  associated  with  primary  memory  locations  of  the  same 
local  memory.      They   need   to  be  accessed  only  by   tV«   processor. 

(c)  Auxiliary  memory  locations  which  are  located  in  local  memory 
locations  and  are  used  only  by  the  corresponding  processor  for  the 
purpose  of   reconstructing   its   previous  moves    (the  Q^   ^  variables). 

(d)  Auxiliary  conmon  memory  locations.  Each  is  associated  with  a 
primary  common  memory  location  and  can  be  accessed  only  by  processors 
that   access   this   primary   memory   location. 

We  come  back  to   the    way   this    observation   is   used   later. 

(3)      Let    1,2 d   be   the  nodes   of   a    graph   of   communication  and   suppose 

that     an     MSDC      eTiplpys     processors      P^,...,?^     at        nodes        l,2,...,d, 


respectively,  of  this  graph.  Suppose  that  the  SDC  employs  processors 
Pj,...,?^  at  the  same  nodes,  respectively,  and  the  edges  of  the  graph 
of  conmunication  are  the  same.  Then  the  SDC  can  simulate  the  MSDC  as 
follows: 

(a)  The   time  does   not   increase   by  more   than  a   constant   factor. 

(b)  Let  m(e)  be  the  number  of  communication  registers  which  are 
associated   with  the  edge   e    of   the   communication  graph  in   the  MSDC.      Say 

that   the    local  memories    of   processors   P^ P^     of      the     MSDC     are     of 

sizes  yj,...,y^  ,  respectively.  Then  the  local  memories  of  these 
processors  in  the  SDC  are  of  sizes  yj+m(l),  y2+m(2),  ....  y^j+mCd)  where 
m(i)  =  j^  m(e)  ;  the  sum  is  over  all  edges  adjacent  to  nodes  l,2,...,d 
in  the  communicatian  graph,  respectively.  Vizing's  edge  coloring 
theoron  [B.er^  73.]  implies  that  the  edges  of  the  communication  graph 
can  be  partitioned  into  c+1  sets,  where  no  two  edges  of  the  same  set 
share  a  vertex.  (c  is  the  valence  of  the  graph  of  communiction. )  So, 
partition  each  cycle  of  the  MSDC  into  c+1  pulses  of  the  SDC;  each  set 
of  edges  is  assigned  to  a  pulse.  Each  local  memory  location  of  the 
MSDC  has  a  corresponding  memory  location  in  the  local  memory  of  the 
corresponding  processor  of  the  SDC.  Each  communication  register  of  the 
MSDC  which  is  assigned  to  an  edge  e  =  (i,j)  has  a  corresponding  memory 
location  at  the  local  memory  of  processor  P.-  (resp.  P^)  in  the  SDC 
(where  i  <  j).  The  above  correspondence  is  one  to  one.  Let  e  =  (i,j) 
an  edge,  as  above,  which  is  assigned  to  pulse  k,  1  <  k  <  c+1,  of  the 
SDC.      Each  pulse   of    the  SDC  consists   of   four   subpulses. 

If    processor  P^    (resp.      P'.)  writes   into,      or     reads      from,      its      local 

memory 
then  processor     P.-      (resp.        ?^)      does   the   same   into   the   corresponding 

memory   location  at   the   first    (resp.      third)    subpulse 
else  if  processor  P^    (resp.      P'^)   reads   a   communication  register 

then   processor  P^    (resp.      P^)   reads   the   corresponding   local  memory 

location  at   the   first  subcycle 

else  if      processor     P^      (resp.        P^)      writes      into  a   communication 

register 

then  processor  P^    (resp.      P^)  writes   the   same      specifying      the 

name      of      the      original      communication     register   into   the 

communication  register  of  the      same     edge     at     the     first 
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(resp.      third)    subpulse.      Processor  P.    (resp.      P-)   copies 

the      contents      written   into   the   communication  register   to 

the   corresponding      local      memory     at     the      second      (resp. 

fourth)   subpulse. 

We     assume     that      in   the   original  cycle  no    two  processors   tried   to 

write  simultaneously   into   the    same  communication  register.        It      should 

be     obvious     how      to     change      the      simulation   for  alternate  assumptions 

regarding    the   way  such  write   conflicts  are   resolved.      The  simulation   is 

correct      and      satisfies      the      claims      concerning  complexity   of   time   and 

space   (size   of    memories).      The  simple  idea  of     using     Vizing's      theorem 

for  regulating   the   synchronization  of  a   synchronous  distributed  machine 

may  also  be   found   in   [Vishkin  82]  .  '  "'"'    ''  ' 

Let  us  go  back  to  the  proof  of  the  theorem.  By  the  proof  of 
Theorem  1  and  observations  1  and  2  we  obtain  from  a  simulation  of  the 
reading  pulse  into  the  SDC  a  simulation  of  the  F&A  pulse  into  the  MSDC. 
Then  applying  obseirvation  3  gives  us  a  simulation  of  the  F&A  pulse  into 
the  SDC. 

Complexity  Evaluation.  We  do  not  use  more  than  a  constant  number  of 
comnunication  registers  for  each  line,  in  the  simulation  into  the  MSDC. 
This  and  the  constant  valence  of  the  communication  graph,  imply  that 
the  simulation  of  the  MSDC  into  the  SDC  does  not  add  to  the  size  of 
each  local  memory  more  than  a  constant.  Therefore,  our  claims 
regarding  the  size  of  the  local  memories  follow  from  Theorem  I.  The 
claim  about  the  time  required  for  the  simulation  follows  the  same 
lines.      This   completes   the   proof   of    Theorem  2. 

Remark.  Observation  (3)  justifies  the  choice  of  the  SDC  rather  than 
the  MSDC  as  a  model  of  synchronous  distributed  computation  and  might  be 
of    independent   interest. 

4.    EXTENSIONS   OF  THE  RESULTS 

There  are  a  few  things  that  can  be  said  about  possible  or 
apparently  possible  extensions    of   the   main  result. 

(1)  Replacing  the  F&A  instruction  by  an  F&*  instruction  for  any 
associative  and  commutative  binary  operation  *  .  The  proofs  of  the 
theorems   can  be  readily  adapted    to  the   change. 

(2)  For  some  constant    d  a  number  of   d  commuttive  and   associative  binary 
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operations  *^  ,  *2  ,  '••,*^  and  corresponding  F&*j  ,  F&*2  ,  ...,  F&*^ 
instructions  are  introduced.  Break  each  cycle  of  the  abstract  model  of 
parallel  computation  into  d+2  pulses;  two  for  local  memory  and  common 
memory  read  operations,  and  d  F&*^  pulses  (1  <;  i  <  d) .  A  similar 
proof    to  ours   follows. 

(3)  Simultaneous  write  instructions  into  the  common  memory  for  several 
processors  in  an  abstract  model  of  computation  are  given.  Say  that  the 
convention  that,  if  more  than  one  processor  seeks  to  write  in  the  same 
memory  location  then  the  smallest-numbered  succeeds,  is  used  by  the 
abstract  model.  (This  convention  is  used  in  [Goldschlager  78], 
[Stockmeyer  and  Vishkin  82]and  [Vishkin  81]).  It  can  be  readily  seen 
that  a  F&Min  instruction  can  simulate  this  convention.  Therefore,  our 
results  apply  to  models   that   use   this    convention,   as  well. 

(4)  Open  Problem.  Assume  that  a  weaker  abstract  model  of  computation 
than  the  CRBWPRAM  is  used.  For  instance,  an  EREW  PRAM  with  some 
"reasonable"  default  assumption  regarding  the  way  cammon  memory  read 
and  write  ccaiflicts  are  being  rseolved.  Can  it  be  proved  that  it  is 
possible  to  extend  'efficiently',  ar^  simulation  of  such  an  abstract 
model  of  computation  into  any  SDC  machine,  into  a  simulation  of  the  F&A 
PRAM  into  the   same  machine. 
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Appendlx 

Step  1. 

The  modified  reading  pulse  simulation  is  performed  in  such  a  way 
that  at  the  first  access  to  address  a  the  variable  aj^  (rasp.  ^2^  i^ 
'turned  on'  and  from  this  access  on  it  plays  the  role  of  a  in  the 
readily  pulse  simulation  (resp.  it  contains  the  name  of  the  address  b, 
such  that  its  contents  (at  time  t  =  0)  was  copied  and  is  presently 
assigned  to  aj^).  If  a^  contains  a  value  (other  than  'undefined')  which 
is   not   associated   with  such  an  address   b  then  a2   is   set   to    'nil'. 

The  main  operations   performed  during    the     modified      reading      pulse 
simulation  are  : 
t  t-    1 
while   t  <    T  do 

for  processor  P^    ,      1  <    i  <    s      do 

perform   (with   respect   to   the  a^   variables)   the 
same   as    in   the   reading   pulse  simulation 
if   the   operation  was   of   the   form  "copy   the 
contents    of  address   b  into  address   c" 

then     Qi^t.l  *  b;    Qi,t,2  "^   c;    Qi,t,3  *  ^1   '•    ^2  -^   ^2 

(*k.eep   record   of   the   operation) 
else  if   something   was  assigned   into  an  address   a.-^ 
then  a2  ■*■    'nil' 
od 

t  •«-    t+1      (simultaneously    for  all  processors) 
od 


This  concludes   the   first  step   of   the   F&A   pulse  simulation. 

Step   2.        Step   2   is   performed   in  such  a   way   that   at   the   first  access   to 
address  a,    the    variable  a-j   is    'turned   on'. 

In    the    beginning  of   step   2;    if  a     is     marked      (there      is      an     F&A     PRAM 
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address  c  such  that  f(c)  =  a)  and  ^2  contains  a  name  of  an  address 
other  than  a^  (this  assures  that  a  is  f(opj)  for  some  1  <  j  <  k  because 
of  our  assumptions)  then,  a^  is  assigned  with  the  contents  of  a  ^  ;  it 
is  'undefined',  otherwise.  If  the  virtual  vertex  (a,t)  of  the  present 
time  unit  t,  1  <  t  <  T  belongs  to  some  tree  T^  then  a^  contains  the  sum 
S(a,t);  and  the  variable  a 2  contains  the  name  of  the  address  which 
corresponds    to   the   root   of   the   tree  T!^. 

Let  us,  now,  describe  step  2.  Again,  we  avoid  mentioning  the 
turning  on  of  variables  and  the  important  procedure:  Procedure  1.  We 
record  in  the  Q^  j.  variables  the  turning  on  of  any  of  the  two  end 
points  if  such  a  'turned  on'  end  point  is  f(opj)  for  some  j,  1  <  j  <  k. 
(See   the  criterion  for   turning   on  a-j.) 

t  -<-    T 

while      t  >    0     do 

for  processor  P^    ,      1   <    1  <    s,    such   that 

its  Q^    J.  variables   indicate   the  existence   of 

a  corresponding   edge   in  Ej^      do 

(Denote      b  ■«-   Q^    ^   ^      and      c  ■<-   Q^    ^  2    • ) 

if   con(c3)    ^  'undefined' 

then  If    con(Qj    t  3 )    =  con(b2) 

then  Q^    J.  ^  ■«-   b3   ;      b3  +   b3  +  C3 

(*  The  sum  In  b      is  stored   for  later   use 
in   step   3.    con(Q^    t  4^    '^  'undefined' 
also  implies    that    (b,t-l)   has  two  sons   in  T'.. 
else  ^3  •«■   C3   ;      ^2  *    Q^   t  3 
C3  ■«-    'undefined'     (*  Since   C3       already  voted 

for   the   corresponding   sum) 
else  reset   Q^    t   1    '      Ql   t  2    ^^^  Ql   t   3    '-°   'undefined' 
(*  erase   the  edge    (  (b,  t-1)  ,  (c,  t)  )  from  a 
corresponding   T.   to  form  Tj) 
od 

t  ■*■    t-1      (*  all  processors   at    once) 
od 
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Step  3»  The  partial  sums  are  conputed  into  a,.  Step  3  is  performed  in 
such  a  way  that  at  a  first  access  to  address  a  the  variable  a^  is 
turned  on.  If  a  is  narked,  a^  contains  the  address  a  and  C3  4- 
'undefined'  then  a  ■<-  C3  +  a  and  a^  ■<-  C3  (a  corresponds  to  a  root  of  a 
T'-   tree);    otherwise  a^  *■    'undefined'. 

t  f    1 

while   p:    T  do 

for  processor  P..    ,      1  <    i   <    s,    such   that   the 
Q^   ^  variables   indicate  an  edge   do 
(Denote  t>  "^   Qi   t   1    ^""^   ^  *"   ^i    t  2^ 
If    con(Q^    t  4^    ~     undefined' 
then     C4  ■<-   b4 
else     C4  *   b4  +  Qi^t,4 

To   finish  step   3  we   have   to  add 

Procedure  2.  Whenever  the  Q^  ^  variables  indicate  that  procedure  1  was 
performed  relative  to  any  of  the  end  points  of  the  corresponding  edge 
assign  a  ♦    a^  f or   such   end  points.      (It    is  a    leaf    of   a   T^   tree.) 
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